{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting started with Pytorch 2.0 and Hugging Face Transformers\n",
    "\n",
    "On December 2, 2022, the PyTorch Team announced [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) at the PyTorch Conference, focused on better performance, being faster, more pythonic, and staying as dynamic as before. \n",
    "\n",
    "This blog post explains how to get started with PyTorch 2.0 and Hugging Face Transformers today. It will cover how to fine-tune a BERT model for Text Classification using the newest PyTorch 2.0 features. \n",
    "\n",
    "You will learn how to:\n",
    "\n",
    "1. Setup environment & install Pytorch 2.0 \n",
    "2. Load and prepare the dataset\n",
    "3. Fine-tune & evaluate BERT model with the Hugging Face `Trainer`\n",
    "4. Run Inference & test model\n",
    "\n",
    "Before we can start, make sure you have a **[Hugging Face Account](https://huggingface.co/join)** to save artifacts and experiments.\n",
    "\n",
    "## Quick intro: Pytorch 2.0\n",
    "\n",
    "PyTorch 2.0 or, better, 1.14 is entirely backward compatible. Pytorch 2.0 will not require any modification to existing PyTorch code but can optimize your code by adding a single line of code with `model = torch.compile(model)`.\n",
    "If you ask yourself, why is there a new major version and no breaking changes? The PyTorch team answered this question in their [FAQ](https://pytorch.org/get-started/pytorch-2.0/#faqs): *“We were releasing substantial new features that we believe change how you meaningfully use PyTorch, so we are calling it 2.0 instead.”* \n",
    "\n",
    "Those new features include top-level support for TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor. \n",
    "\n",
    "This allows PyTorch 2.0 to achieve a 1.3x-2x training time speedups supporting [today's 46 model architectures](https://github.com/pytorch/torchdynamo/issues/681) from [HuggingFace Transformers](https://github.com/huggingface/transformers)\n",
    "\n",
    "If you want to learn more about PyTorch 2.0, check out the official [“GET STARTED”](https://pytorch.org/get-started/pytorch-2.0/).\n",
    "\n",
    "---\n",
    "\n",
    "Now we know how PyTorch 2.0 works, let's get started. 🚀\n",
    "\n",
    "*Note: This tutorial was created and run on a g5.xlarge AWS EC2 Instance, including an NVIDIA A10G GPU.*\n",
    "\n",
    "## Setup environment & install Pytorch 2.0\n",
    "\n",
    "Our first step is to install PyTorch 2.0 and the Hugging Face Libraries, including `transformers` and `datasets`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install PyTorch 2.0\n",
    "!pip install \"torch>=2.0\" --extra-index-url https://download.pytorch.org/whl/cu117 --upgrade --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, we are installing the latest version of `transformers` from the `main` git branch, which includes the native integration of PyTorch 2.0 into the `Trainer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install transformers and dataset\n",
    "!pip install \"transformers==4.27.1\" \"datasets==2.9.0\" \"accelerate==0.17.1\" \"evaluate==0.4.0\" tensorboard scikit-learn --upgrade --quiet\n",
    "# Install git-fls for pushing model and logs to the hugging face hub\n",
    "!sudo apt-get install git-lfs --yes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To push our model to the Hub, you must register on the [Hugging Face](https://huggingface.co/join). If you already have an account, you can skip this step. After you have an account, we will use the `login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "\n",
    "login(\n",
    "  token=\"\", # ADD YOUR TOKEN HERE\n",
    "  add_to_git_credential=True\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and prepare the dataset\n",
    "\n",
    "To keep the example straightforward, we are training a Text Classification model on the [BANKING77](https://huggingface.co/datasets/banking77) dataset. The BANKING77 dataset provides a fine-grained set of intents (classes) in a banking/finance domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. ****\n",
    "\n",
    "We will use the `load_dataset()` method from the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the `banking77`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found cached dataset banking77 (/home/ubuntu/.cache/huggingface/datasets/banking77/default/1.1.0/ff44c4421d7e70aa810b0fa79d36908a38b87aff8125d002cd44f7fcd31f493c)\n"
     ]
    },
    {
     "data": {
      "application/json": {
       "ascii": false,
       "bar_format": null,
       "colour": null,
       "elapsed": 0.006680011749267578,
       "initial": 0,
       "n": 0,
       "ncols": null,
       "nrows": null,
       "postfix": null,
       "prefix": "",
       "rate": null,
       "total": 2,
       "unit": "it",
       "unit_divisor": 1000,
       "unit_scale": false
      },
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eb078f1a17c3458281a72d6a38246fe2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train dataset size: 10003\n",
      "Test dataset size: 3080\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "# Dataset id from huggingface.co/dataset\n",
    "dataset_id = \"banking77\" \n",
    "\n",
    "# Load raw dataset\n",
    "raw_dataset = load_dataset(dataset_id)\n",
    "\n",
    "print(f\"Train dataset size: {len(raw_dataset['train'])}\")\n",
    "print(f\"Test dataset size: {len(raw_dataset['test'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s check out an example of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'text': 'Why am I getting declines when trying to make a purchase online?',\n",
       " 'label': 27}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from random import randrange\n",
    "\n",
    "random_id = randrange(len(raw_dataset['train']))\n",
    "raw_dataset['train'][random_id]\n",
    "# {'text': \"I can't get google pay to work right.\", 'label': 2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train our model, we need to convert our \"Natural Language\" to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary) if you want to learn more about this, out **[chapter 6](https://huggingface.co/course/chapter6/1?fw=pt)** of the [Hugging Face Course](https://huggingface.co/course/chapter1/1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "# Model id to load the tokenizer\n",
    "model_id = \"bert-base-uncased\"\n",
    "# Load Tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "\n",
    "# Tokenize helper function\n",
    "def tokenize(batch):\n",
    "    return tokenizer(batch['text'], padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "\n",
    "# Tokenize dataset\n",
    "raw_dataset =  raw_dataset.rename_column(\"label\", \"labels\") # to match Trainer\n",
    "tokenized_dataset = raw_dataset.map(tokenize, batched=True,remove_columns=[\"text\"])\n",
    "\n",
    "print(tokenized_dataset[\"train\"].features.keys())\n",
    "# dict_keys(['input_ids', 'token_type_ids', 'attention_mask','lable'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Fine-tune & evaluate BERT model with the Hugging Face `Trainer`\n",
    "\n",
    "After we have processed our dataset, we can start training our model. We will use the bert-base-uncased model. The first step is to load our model with `AutoModelForSequenceClassification` class from the [Hugging Face Hub](https://huggingface.co/bert-base-uncased). This will initialize the pre-trained BERT weights with a classification head on top. Here we pass the number of classes (77) from our dataset and the label names to have readable outputs for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "# Model id to load the tokenizer\n",
    "model_id = \"bert-base-uncased\"\n",
    "\n",
    "# Prepare model labels - useful for inference\n",
    "labels = tokenized_dataset[\"train\"].features[\"labels\"].names\n",
    "num_labels = len(labels)\n",
    "label2id, id2label = dict(), dict()\n",
    "for i, label in enumerate(labels):\n",
    "    label2id[label] = str(i)\n",
    "    id2label[str(i)] = label\n",
    "\n",
    "# Download the model from huggingface.co/models\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\n",
    "    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics` method. We use the `evaluate` library to calculate the [f1 metric](https://huggingface.co/spaces/evaluate-metric/f1) during training on our test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import evaluate\n",
    "import numpy as np\n",
    "\n",
    "# Metric Id\n",
    "metric = evaluate.load(\"f1\")\n",
    "\n",
    "# Metric helper method\n",
    "def compute_metrics(eval_pred):\n",
    "    predictions, labels = eval_pred\n",
    "    predictions = np.argmax(predictions, axis=1)\n",
    "    return metric.compute(predictions=predictions, references=labels, average=\"weighted\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step is to define the hyperparameters (`TrainingArguments`) we use for our training. Here we are adding the PyTorch 2.0 introduced features for fast training times. To use the latest improvements of PyTorch 2.0, we only need to pass the `torch_compile` option in the `TrainingArguments`.\n",
    "\n",
    "We also leverage the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to push our checkpoints, logs, and metrics during training into a repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import HfFolder\n",
    "from transformers import Trainer, TrainingArguments\n",
    "\n",
    "# Id for remote repository\n",
    "repository_id = \"bert-base-banking77-pt2\"\n",
    "\n",
    "# Define training args\n",
    "training_args = TrainingArguments(\n",
    "    output_dir=repository_id,\n",
    "    per_device_train_batch_size=16,\n",
    "    per_device_eval_batch_size=8,\n",
    "    learning_rate=5e-5,\n",
    "\t\tnum_train_epochs=3,\n",
    "\t\t# PyTorch 2.0 specifics \n",
    "    bf16=True, # bfloat16 training \n",
    "\t\ttorch_compile=True, # optimizations\n",
    "    optim=\"adamw_torch_fused\", # improved optimizer \n",
    "    # logging & evaluation strategies\n",
    "    logging_dir=f\"{repository_id}/logs\",\n",
    "    logging_strategy=\"steps\",\n",
    "    logging_steps=200,\n",
    "    evaluation_strategy=\"epoch\",\n",
    "    save_strategy=\"epoch\",\n",
    "    save_total_limit=2,\n",
    "    load_best_model_at_end=True,\n",
    "    metric_for_best_model=\"f1\",\n",
    "    # push to hub parameters\n",
    "    report_to=\"tensorboard\",\n",
    "    push_to_hub=True,\n",
    "    hub_strategy=\"every_save\",\n",
    "    hub_model_id=repository_id,\n",
    "    hub_token=HfFolder.get_token(),\n",
    "\n",
    ")\n",
    "\n",
    "# Create a Trainer instance\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tokenized_dataset[\"train\"],\n",
    "    eval_dataset=tokenized_dataset[\"test\"],\n",
    "    compute_metrics=compute_metrics,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can start our training by using the **`train`** method of the `Trainer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start training\n",
    "trainer.train()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![tensorboard](../assets/tensorboard.png)\n",
    "\n",
    "Using Pytorch 2.0 and supported features in `transformers` allows us train our BERT model on `10_000` samples within `457.7964` seconds. \n",
    "\n",
    "We also ran the training without the `torch_compile` option to compare the training times. The training without `torch_compile` took 457 seconds, had a `train_samples_per_second` value of 65.55 and an `f1` score of `0.931`.\n",
    "\n",
    "```bash\n",
    "{'train_runtime': 696.2701, 'train_samples_per_second': 43.1, 'eval_f1': 0.928788}\n",
    "```\n",
    "\n",
    "By using the `torch_compile` option and the `adamw_torch_fused` optimized , we can see that the training time is reduced by 52.5% compared to the training without PyTorch 2.0. \n",
    "\n",
    "```bash\n",
    "{'train_runtime': 457.7964, 'train_samples_per_second': 65.55, 'eval_f1': 0.931773}\n",
    "```\n",
    "\n",
    "Our absoulte training time went down from 696s to 457. The `train_samples_per_second` value increased from 43 to 59. The `f1` score is the same/slighty better than the training without `torch_compile`.\n",
    "\n",
    "Pytorch 2.0 is incredible powerful! 🚀 \n",
    "\n",
    "Lets save our results and tokenizer to the Hugging Face Hub and create a model card."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save processor and create model card\n",
    "tokenizer.save_pretrained(repository_id)\n",
    "trainer.create_model_card()\n",
    "trainer.push_to_hub()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Run Inference & test model\n",
    "\n",
    "To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the `pipeline` method from the `transformers` library to run inference on our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'label': 'card_arrival', 'score': 0.9903606176376343}]\n"
     ]
    }
   ],
   "source": [
    "from transformers import pipeline\n",
    "\n",
    "# load model from huggingface.co/models using our repository id\n",
    "classifier = pipeline(\"sentiment-analysis\", model=repository_id, tokenizer=repository_id, device=0)\n",
    "\n",
    "sample = \"I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?\"\n",
    "\n",
    "\n",
    "pred = classifier(sample)\n",
    "print(pred)\n",
    "# [{'label': 'card_arrival', 'score': 0.9903606176376343}]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this tutorial, we learned how to use PyTorch 2.0 to train a text classification model on the BANKING77 dataset. We saw that PyTorch 2.0 is a powerful tool to speed up your training times. In our example running on a NVIDIA A10G we managed to achieve 52.5% better performance. The Hugging Face Trainer allows you to easily integrate PyTorch 2.0 into your training pipeline by simply adding the `torch_compile` option to the `TrainingArguments`. We can further benefit from PyTorch 2.0 by using the new fused AdamW optimizer when bf16 is available. \n",
    "\n",
    "Additionally, I want to mentioned that we reduced the training time by 52%, which could be interpreted in a cost saving of 52% for the training or in 52% faster iterations cycles and time to production. You should be able to see even better improvements by using A100 GPUs or by reducing the \"Trainer\" overhead, e.g. removing evaluation and logging. \n",
    "\n",
    "PyTorch 2.0 is now officially launched and we are excited to see what the future brings. 🚀"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
